237 research outputs found

    An empirical analysis of information filtering methods

    Get PDF
    The growth in the the number of news articles, blogs, images, and videos available on the Web is making if more challenging for people to find potentially useful information People have relied on search engines to satisfy their short-term needs, such as finding the telephone number for a restaurant; however, these systems have not been designed to support long-term needs, such as the research interests of academics. One approach to supporting long-term needs is to use an Information Filtering system to select potentially useful information from the vast amount being produced everyday. The similarities between Information Retrieval systems and Information Filtering systems are well-established. They have prompted the use of retrieval models and methods in filtering systems, which has had some success but has been criticised as a limiting factor due to the unique challenges of document filtering. A significant difference between these systems is the use case: a filtering system is intended to push information to the user over a period of time, whereas a retrieval system is intended for the user to pull information to themselves for immediate use. The main challenge that needs to be addressed by a filtering system is the transient nature of the information published on the Web and the drifting nature of information needs. These factors lead to an uncertain interplay between the components comprising a filtering system and this thesis presents an empirical analysis of how the main system components affect performance. The analysis explores the role of each system component independently and in conjunction with other components. The main contribution of this thesis is a deeper understanding of how different components affect performance and the interplay between these components. The outcome of this thesis intends to act as a guide for both practitioners and researchers interested in overcoming some of the challenges of building filtering system

    The Role of Syntactic Planning in Compositional Image Captioning

    Full text link
    Image captioning has focused on generalizing to images drawn from the same distribution as the training set, and not to the more challenging problem of generalizing to different distributions of images. Recently, Nikolaus et al. (2019) introduced a dataset to assess compositional generalization in image captioning, where models are evaluated on their ability to describe images with unseen adjective-noun and noun-verb compositions. In this work, we investigate different methods to improve compositional generalization by planning the syntactic structure of a caption. Our experiments show that jointly modeling tokens and syntactic tags enhances generalization in both RNN- and Transformer-based models, while also improving performance on standard metrics.Comment: Accepted at EACL 202

    Lessons learned in multilingual grounded language learning

    Full text link
    Recent work has shown how to learn better visual-semantic embeddings by leveraging image descriptions in more than one language. Here, we investigate in detail which conditions affect the performance of this type of grounded language learning model. We show that multilingual training improves over bilingual training, and that low-resource languages benefit from training with higher-resource languages. We demonstrate that a multilingual model can be trained equally well on either translations or comparable sentence pairs, and that annotating the same set of images in multiple language enables further improvements via an additional caption-caption ranking objective.Comment: CoNLL 201

    The Sensitivity of Language Models and Humans to Winograd Schema Perturbations

    Full text link
    Large-scale pretrained language models are the major driving force behind recent improvements in performance on the Winograd Schema Challenge, a widely employed test of common sense reasoning ability. We show, however, with a new diagnostic dataset, that these models are sensitive to linguistic perturbations of the Winograd examples that minimally affect human understanding. Our results highlight interesting differences between humans and language models: language models are more sensitive to number or gender alternations and synonym replacements than humans, and humans are more stable and consistent in their predictions, maintain a much higher absolute performance, and perform better on non-associative instances than associative ones. Overall, humans are correct more often than out-of-the-box models, and the models are sometimes right for the wrong reasons. Finally, we show that fine-tuning on a large, task-specific dataset can offer a solution to these issues.Comment: ACL 202

    Retrieval-augmented Image Captioning

    Full text link
    Inspired by retrieval-augmented language generation and pretrained Vision and Language (V&L) encoders, we present a new approach to image captioning that generates sentences given the input image and a set of captions retrieved from a datastore, as opposed to the image alone. The encoder in our model jointly processes the image and retrieved captions using a pretrained V&L BERT, while the decoder attends to the multimodal encoder representations, benefiting from the extra textual evidence from the retrieved captions. Experimental results on the COCO dataset show that image captioning can be effectively formulated from this new perspective. Our model, named EXTRA, benefits from using captions retrieved from the training dataset, and it can also benefit from using an external dataset without the need for retraining. Ablation studies show that retrieving a sufficient number of captions (e.g., k=5) can improve captioning quality. Our work contributes towards using pretrained V&L encoders for generative tasks, instead of standard classification tasks

    Towards Succinct and Relevant Image Descriptions

    Get PDF
    What does it mean to produce a good description of an image? Is a description good because it correctly identifies all of the objects in the image, because it describes the interesting attributes of the objects, or because it is short, yet informative? Grice’s Cooperative Principle, stated as “Make your contribution such as is required, at the stage at which it occurs, by the accepted purpose or direction of the talk exchange in which you are engaged ” (Grice, 1975), alongside other ideas of pragmatics in communication, have proven useful in thinking about language generation (Hovy, 1987; McKeown et al., 1995). The Cooperative Principle provides one possible framework for thinking about the generation and evaluation of image descriptions.1 The immediate question is whether automatic image description is within the scope of the Cooperative Principle. Consider the task of searching for images using natural language, where the purpose of the exchange is for the user to quickly and accurately find images that match their information needs. In this scenario, the user formulates a complete sentence query to express their needs, e.g. A sheepdog chasing sheep in a field, and initiates an exchange with the system in the form of a sequence of one-shot con-versations. In this exchange, both participants can describe images in natural language, and a successful outcome relies on each participant succinctly and correctly expressing their beliefs about the images. I

    Structured representation of images for language generation and image retrieval

    Get PDF
    A photograph typically depicts an aspect of the real world, such as an outdoor landscape, a portrait, or an event. The task of creating abstract digital representations of images has received a great deal of attention in the computer vision literature because it is rarely useful to work directly with the raw pixel data. The challenge of working with raw pixel data is that small changes in lighting can result in different digital images, which is not typically useful for downstream tasks such as object detection. One approach to representing an image is automatically extracting and quantising visual features to create a bag-of-terms vector. The bag-of-terms vector helps overcome the problems with raw pixel data but this unstructured representation discards potentially useful information about the spatial and semantic relationships between the parts of the image. The central argument of this thesis is that capturing and encoding the relationships between parts of an image will improve the performance of extrinsic tasks, such as image description or search. We explore this claim in the restricted domain of images representing events, such as riding a bicycle or using a computer. The first major contribution of this thesis is the Visual Dependency Representation: a novel structured representation that captures the prominent region–region relationships in an image. The key idea is that images depicting the same events are likely to have similar spatial relationships between the regions contributing to the event. This representation is inspired by dependency syntax for natural language, which directly captures the relationships between the words in a sentence. We also contribute a data set of images annotated with multiple human-written descriptions, labelled image regions, and gold-standard Visual Dependency Representations, and explain how the gold-standard representations can be constructed by trained human annotators. The second major contribution of this thesis is an approach to automatically predicting Visual Dependency Representations using a graph-based statistical dependency parser. A dependency parser is typically used in Natural Language Processing to automatically predict the dependency structure of a sentence. In this thesis we use a dependency parser to predict the Visual Dependency Representation of an image because we are working with a discrete image representation – that of image regions. Our approach can exploit features from the region annotations and the description to predict the relationships between objects in an image. In a series of experiments using gold-standard region annotations, we report significant improvements in labelled and unlabelled directed attachment accuracy over a baseline that assumes there are no relationships between objects in an image. Finally, we find significant improvements in two extrinsic tasks when we represent images as Visual Dependency Representations predicted from gold-standard region annotations. In an image description task, we show significant improvements in automatic evaluation measures and human judgements compared to state-of-the-art models that use either external text corpora or region proximity to guide the generation process. In the query-by-example image retrieval task, we show a significant improvement in Mean Average Precision and the precision of the top 10 images compared to a bag-of-terms approach. We also perform a correlation analysis of human judgements against automatic evaluation measures for the image description task. The automatic measures are standard measures adopted from the machine translation and summarization literature. The main finding of the analysis is that unigram BLEU is less correlated with human judgements than Smoothed BLEU, Meteor, or skip-bigram ROUGE
    • …
    corecore